6 research outputs found

    Machine Learning Approaches for Improving Prediction Performance of Structure-Activity Relationship Models

    Get PDF
    In silico bioactivity prediction studies are designed to complement in vivo and in vitro efforts to assess the activity and properties of small molecules. In silico methods such as Quantitative Structure-Activity/Property Relationship (QSAR) are used to correlate the structure of a molecule to its biological property in drug design and toxicological studies. In this body of work, I started with two in-depth reviews into the application of machine learning based approaches and feature reduction methods to QSAR, and then investigated solutions to three common challenges faced in machine learning based QSAR studies. First, to improve the prediction accuracy of learning from imbalanced data, Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms combined with bagging as an ensemble strategy was evaluated. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that this method significantly outperformed other conventional methods. SMOTEENN with bagging became less effective when IR exceeded a certain threshold (e.g., \u3e40). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. Deep neural networks (DNN) and random forest (RF), representing deep and shallow learning algorithms, respectively, were chosen to carry out structure-activity relationship-based chemical toxicity prediction. Results suggest that DNN significantly outperformed RF (p \u3c 0.001, ANOVA) by 22-27% for four metrics (precision, recall, F-measure, and AUPRC) and by 11% for another (AUROC). Lastly, current features used for QSAR based machine learning are often very sparse and limited by the logic and mathematical processes used to compute them. Transformer embedding features (TEF) were developed as new continuous vector descriptors/features using the latent space embedding from a multi-head self-attention. The significance of TEF as new descriptors was evaluated by applying them to tasks such as predictive modeling, clustering, and similarity search. An accuracy of 84% on the Ames mutagenicity test indicates that these new features has a correlation to biological activity. Overall, the findings in this study can be applied to improve the performance of machine learning based Quantitative Structure-Activity/Property Relationship (QSAR) efforts for enhanced drug discovery and toxicology assessments

    Deep Learning-Based Structure-Activity Relationship Modeling for Multi-Category Toxicity Classification: A Case Study of 10K Tox21 Chemicals With High-Throughput Cell-Based Androgen Receptor Bioassay Data

    Get PDF
    Deep learning (DL) has attracted the attention of computational toxicologists as it offers a potentially greater power for in silico predictive toxicology than existing shallow learning algorithms. However, contradicting reports have been documented. To further explore the advantages of DL over shallow learning, we conducted this case study using two cell-based androgen receptor (AR) activity datasets with 10K chemicals generated from the Tox21 program. A nested double-loop cross-validation approach was adopted along with a stratified sampling strategy for partitioning chemicals of multiple AR activity classes (i.e., agonist, antagonist, inactive, and inconclusive) at the same distribution rates amongst the training, validation and test subsets. Deep neural networks (DNN) and random forest (RF), representing deep and shallow learning algorithms, respectively, were chosen to carry out structure-activity relationship-based chemical toxicity prediction. Results suggest that DNN significantly outperformed RF (p \u3c 0.001, ANOVA) by 22–27% for four metrics (precision, recall, F-measure, and AUPRC) and by 11% for another (AUROC). Further in-depth analyses of chemical scaffolding shed insights on structural alerts for AR agonists/antagonists and inactive/inconclusive compounds, which may aid in future drug discovery and improvement of toxicity prediction modeling

    A Review On Machine Learning Methods For \u3ci\u3eIn Silico\u3c/i\u3e Toxicity Prediction

    No full text
    In silico toxicity prediction plays an important role in the regulatory decision making and selection of leads in drug design as in vitro/vivo methods are often limited by ethics, time, budget, and other resources. Many computational methods have been employed in predicting the toxicity profile of chemicals. This review provides a detailed end-to-end overview of the application of machine learning algorithms to Structure-Activity Relationship (SAR)-based predictive toxicology. From raw data to model validation, the importance of data quality is stressed as it greatly affects the predictive power of derived models. Commonly overlooked challenges such as data imbalance, activity cliff, model evaluation, and definition of applicability domain are highlighted, and plausible solutions for alleviating these challenges are discussed

    Target-Specific Toxicity Knowledgebase (TsTKb): A Novel Toolkit for \u3ci\u3eIn Silico\u3c/i\u3e Predictive Technology

    No full text
    As the number of man-made chemicals increases at an unprecedented pace, efforts of quickly screening and accurately evaluating their potential adverse biological effects have been hampered by prohibitively high costs of in vivo/vitro toxicity testing. While it is unrealistic and unnecessary to test every uncharacterized chemical, it remains a major challenge to develop alternative in silico tools with high reliability and precision in toxicity prediction. To address this urgent need, we have developed a novel mode-of-action-guided, molecular modeling-based, and machine learning-enabled modeling approach for in silico chemical toxicity prediction. Here we introduce the core element of this approach, Target-specific Toxicity Knowledgebase (TsTKb), which consists of two main components: Chemical Mode of Action (ChemMoA) database and a suite of prediction model libraries

    Structure–Activity Relationship-Based Chemical Classification of Highly Imbalanced Tox21 Datasets

    Get PDF
    The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure–Activity Relationship (SAR)-based chemical classification. Undersampling and oversampling are representative techniques for handling such an imbalance challenge. However, removing inactive chemical compound instances from the majority class using an undersampling technique can result in information loss, whereas increasing active toxicant instances in the minority class by interpolation tends to introduce artificial minority instances that often cross into the majority class space, giving rise to class overlapping and a higher false prediction rate. In this study, in order to improve the prediction accuracy of imbalanced learning, we employed SMOTEENN, a combination of Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms, to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. We chose the highly imbalanced Tox21 dataset, which consisted of 12 in vitro bioassays for \u3e 10,000 chemicals that were distributed unevenly between binary classes. With Random Forest (RF) as the base classifier and bagging as the ensemble strategy, we applied four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with Random Undersampling (RUS), RF with SMOTE (SMO), and RF with SMOTEENN (SMN). The performance of the four learning methods was compared using nine evaluation metrics, among which F1 score, Matthews correlation coefficient and Brier score provided a more consistent assessment of the overall performance across the 12 datasets. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that SMN significantly outperformed the other three methods. We also found that a strong negative correlation existed between the prediction accuracy and the imbalance ratio (IR), which is defined as the number of inactive compounds divided by the number of active compounds. SMN became less effective when IR exceeded a certain threshold (e.g., \u3e 28). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. This work demonstrates that the performance of SAR-based, imbalanced chemical toxicity classification can be significantly improved through the use of data rebalancing

    Mode-of-Action-Guided, Molecular Modeling-Based Toxicity Prediction: A Novel Approach for \u3ci\u3eIn Silico\u3c/i\u3e Predictive Toxicology

    No full text
    Computational toxicology is a sub-discipline of toxicology concerned with the development and use of computer-based models and methodology to understand and predict chemical toxicity in a biological system (e.g., cells and organisms). Quantitative structure–activity relationship (QSAR) has been the predominant approach in computational toxicology. However, classical QSAR methodology has often suffered from low prediction accuracy, largely owing to the lack or non-integration of toxicological mechanisms. To address this lingering problem, we have developed a novel in silico toxicology approach that is based on molecular modeling and guided by mode of action (MoA). Our approach is implemented through a target-specific toxicity knowledgebase (TsTKb), consisting of a pre-categorized database of chemical MoA (ChemMoA) and a series of pre-built, category-specific classification and quantification models. ChemMoA serves as the depository of chemicals with known MoAs or molecular initiating events (i.e., known target biomacromolecules) and quantitative information for measured toxicity endpoints (if available). The models allow a user to qualitatively classify an uncharacterized chemical by MoA and quantitatively predict its toxicity potency. This approach is currently under development and will evolve to incorporate physiologically based pharmacokinetic (PBPK) modeling to address absorption, distribution, metabolism and excretion (ADME) processes in a biological system. The fully developed approach is believed to significantly advance in silico -based predictive toxicology and provide a new powerful toolbox for regulators, the chemical industry and the relevant academic communities